library(slendr)
init_env()
<... population definitions ...>
<... gene-flow definition ...>
model <- compile_model(
populations = list(...),
gene_flow = <...>,
generation_time = 30
)Building intuition into popgen fundamentals and inference
Exercises for the Workshop on population and speciation genomics
Reference materials
As you will be working on the exercises below (all of which focus on the popgen simulation R package slendr), some reference materials will be useful:
First and foremost, you should refer to the slides with the slendr “crash course” which we started our activity session with, and which cover pretty much every bit of slendr functionality you will need. You can find the material rendered as normal slides or one-page handouts (the latter will be a bit more practical for reference, as you won’t have to search for information by flipping through individual slides). If these Quarto Pub links are not working, it could be because their servers are overloaded. In this case, you can simply open
slides.htmlorhandouts.htmlfrom the GitHub repository for this activity (you will have them also in the cloud – instructions below).Manual pages of all available slendr functions. Note that you can get the help page of every slendr R
function()by typing?functionin the R console, and read it directly in the RStudio. For instance, typing?ts_tajimagives you the help page of the tskit/slendr function implementing the tree-based computation of Tajima’s D. Whenever I discuss an optional parameter of some function (or whenever you want to modify a function’s behavior), this is where you can find more information about it.Relevant tutorials on the slendr website which you can find under the “Articles” link in the header of the webpage. In general, the website contains much more detailed information which is available in the handouts in a much more condensed form. I think this is probably too much information for this activity, but don’t hesitate to read a bit more if you’re interested!
Installation setup
If you’re using the RStudio browser-based cloud instance provided by the workshop organizes, you don’t actually have to do anything! Just log in to the RStudio session in your browser and navigate to the directory for this activity via File -> Open Project... in the RStudio menu (selecting the cesky-krumlov-2025.Rproj project file in that directory with open a fresh R session for that project in your RStudio).
If you want to open the most up-to date resources from the cloud instance directly (rather than from the quarto.pub hosting service accessible via the links to handouts.html and exercises.html above), you might want to run this inside the directory with this activity’s files on the cloud:
# first go to the directory with the activity files
cd <path to the activity directory on the cloud>
# then pull the latest changes from GitHub
git pull https://github.com/bodkan/cesky-krumlov-2025 main
Organization of the exercises
For each exercise, you will get a brief explanation of the problem at hand and some information about functions that could be useful to solve the exercise. The concrete, specific task will be always written like this in bold. As you work on each part of each exercises, look for these bold lines.
Your goal for each exercise is to write a complete R script script (in RStudio File -> New file -> R script). I suggest you save the script for each exercise as exercise1.R, exercise2.R, etc., just to keep things tidy and easy to troubleshoot if needed.
Each exercise is composed of individual parts, which are designed to build one upon the other in the order they are specified. As you progress through sequential parts of each exercise, you will be adding code to your script for that exercise.
Don’t worry about the number of exercises you’re “supposed to do”
How far we manage to crunch through the content of this session will depend on everybody’s level of confidence in programming and population genetics. The contents of the activities as a whole are designed so that no matter the moment at which we run out of time, you will take home enough knowledge of slendr to be able to immediately apply it to your own projects.
To avoid anyone getting overwhelmed, I’ll be stopping the activity at some “checkpoints”, to check in with how everything is going and quickly go through the example solutions for the parts until that point, and give a quick explanation.
If you find yourself way ahead, there are bonus exercises (which I will not be going through with everyone) – those should provide at entertainment and a bit more challenge for you while the rest of us are catching up. If you’re really far ahead, just continue to the next exercise without waiting for the rest of us.
Note on the programming aspect of the exercises
All the exercises will involve “real coding”! If you’ve never really programmed entire scripts before, this could feel a little intimidating. Don’t worry. If you’re ever lost, just take a peek at the solutions which are (by default hidden) under each exercise part. Always try to work on a solution on your own, but never let this be a barrier to your progress. Feel free to copy-paste bits of my solutions into your own scripts.
If you find yourself totally lost, don’t hesitate to read my solutions from the get go, copy-pasting them into your own script in the RStudio, executing them line by line, and trying to understand what’s going on. It’s the understanding that we’re after here. Whether or not you can implement everything yourself from scratch does not actually matter at all. (Nobody in history has learned to program from scratch. Everyone started by copy-pasting code examples written by someone else.)
Exercise 1: Programming demographic models with slendr
Part 1: Building a demographic model in slendr
Use functions such as population(), gene_flow(), and compile_model(), which we discussed in the “crash course” at the start of this session, to program the following toy model of human demographic history in slendr. (Apologies for my bad handwriting and poor artistic ability.)
Note: You could easily program the model so that different ancestral populations are represented by separate population() commands (i.e., your model would start with a population called “human_chimp_ancestor” from which a “CHIMP” and “hominin_ancestor” populations would split at 6 Mya, etc.) but generally this is too annoying to do and requires too much code.
Feel free to write the model so that “CHIMP” is the first population, then “AFR” population splits from it at 6 Mya, etc… Although it probably isn’t the most accurate way to describe the real evolutionary history, it simplifies the coding significantly.
[Mya = million years ago; kya = thousand years ago]
Hint: Start script exercise1.R script in your RStudio session using the following “template”. Then add a sequence of appropriate population() calls using the syntax from the introductory slides (using the parent = <pop> argument for programming splits of daughter populations – which will be all except the CHIMP lineage in our example), etc.
Note: With slendr you can specify time in whatever format is more convenient or readable for your model. For instance here, because we’re dealing with historical events which are commonly expressed in times given as”years ago”, we can write them in a decreasing order – i.e. 7Mya → 6Mya → …, as shown above – or, in terms of R code, 7e6 (or 7000000), 6e6 (6000000), etc.
In a later example, you will see that you can also encode the events in the time direction going “forward” (i.e., the first event starting in generation 1, a following event in generation 42, and so on).
Hint: Remember that slendr is designed with interactivity in mind! When you write a chunk of code (such as a command to create a population through a population split, or model compilation to create a model object), execute that bit of code in the R console and inspect the summary information printed by evaluating the respective R object you just created. You can either copy-pasted stuff from your script to the R console, or used a convenient RStudio shortcut like Ctrl+Enter (Linux and Windows), or Cmd+Enter (Mac).
Part 2: Inspecting the model visually
To visualize a slendr model, you use the function plot_model(). Plot your compiled model to make sure you programmed it correctly! Your figure should roughly correspond to my doodle above.
Note: Plotting of models in slendr can be sometimes a little wonky, especially if many things are happening at once. When plotting your model, experiment with arguments log = TRUE, proportions = TRUE, gene_flow = TRUE. Check ?plot_model for more information on these.
Part 3: Simulating genomic data
Once you have a compiled slendr model stored in an R variable (from now on, model will always mean a variable containing a compiled slendr model object relevant for the given exercise, for simplicity), we can simulate data from it. By default, slendr models always produce a tree sequence.
Note: Tree sequence provides an extremely efficient means to store and work with genomic data at a massive scale. However, you can always get simulated data even in traditional file formats, such as VCF, EIGENSTRAT, or a plain old table of ancestral/derived genotypes.
In this activity we will be only working with tree sequences, because it’s much easier and faster to get interesting statistics from it directly in R.
There are two simulation engines built into slendr implemented by functions msprime() and slim(). For traditional, non-spatial, neutral demographic models, the engine provided by the msprime() function is much more efficient, so we’ll be using that for the time being. However, from a popgen theoretical perspective, both simulation functions will give you the same results for any given compiled slendr model (up to some level of stochastic noise, of course).
Note: Yes, this means you don’t have to write any msprime (or SLiM) code to simulate data from a slendr model!
Here’s how you can use the function to simulate a tree sequence from the model you’ve just created using compile_model() in your script:
ts <- msprime(
model,
sequence_length = <length of sequence to simulate [as bp]>,
recombination_rate = <uniform recombination rate [per bp per generation]>
)You will be seeing this kind of pattern over and over again in this exercise, so it’s a good idea to keep it in mind.
Hint: The msprime() function has also arguments debug and run which can be extremely useful for debugging.
Simulate a tree sequence from your compiled model using the msprime() engine, storing it to a variable ts as shown right above. Use sequence_length = 1e6 (so 1 Mb of sequence) and recombination_rate = 1e-8 (crossover events per base pair per generation). Then experiment with setting debug = TRUE (this prints out msprime’s own debugging summary which you might already be familiar with from your previous activity?) and then run = FALSE (this prints out a raw command-line which can run a slendr simulation in the shell).
Part 4: Inspecting the tree-sequence object
As we will see later, slendr provides an R-friendly interface to accessing a subset of tskit’s functionality for working with tree sequence and for computing various popgen statistics.
For now, type out the ts object in the terminal – what do you see? You should get a summary of a tree-sequence object that you’re familiar with from your msprime and tskit activity earlier in the day.
This is a very important feature of slendr – when a simulation is concluded (doesn’t matter if it was a slim() or msprime() simulation), you will get a normal tskit object. In fact, the fact that slendr supports (so far, and likely always) only a “subset” of all of tskit’s functionality isn’t stopping you to write custom Python/tskit processing code of a tree sequence generated from a slendr model. Under the hood, a slendr simulation really is just an msprime (or SLiM) simulation! It’s just executed through a simplified interface (and dare I say a bit more convenient, for some purposes).
The brilliance of the tree-sequence data structure rests on its elegant table-based implementation (much more information on that is here). slendr isn’t really designed to run very complex low-level manipulations of tree-sequence data (it’s strength lies in the convenient interface to popgen statistical functions implemented by tskit), but it does contain a couple of functions which can be useful for inspecting the lower-level nature of a tree sequence. Let’s look at a couple of them now.
Use the ts_table function to inspect the low-level table-based representation of a tree sequence. For instance, you can get the table of nodes with ts_table(ts, "nodes"), edges with ts_table(ts, "edges"), and do the same thing for “individuals”, “mutations”, and “sites”. Does your tree sequence contain any mutations? If not, why, and how can we even do any popgen with data without any mutations? As youre doing all this, take a look at at the following figure (this was made from a different tree sequence than you have, but that’s OK) to help you relate the information in the tables to a tree sequence which those tables (particularly tables of nodes and edges) implicitly encode.
There are also two slendr-specific functions called ts_samples() (which retrieves the “symbolic names” and dates of all recorded individuals at the end of a simulation) and ts_nodes(). You can run them simply as ts_samples(ts) and ts_nodes(ts). How many individuals (samples) are in your tree sequence as you simulated it? How is the result of ts_nodes() different from ts_samples()?
Part 5: Scheduling sampling events
In the table produced by the ts_samples() function, you saw that the tree sequence we simulated recorded everyone. It’s very unlikely, unless we’re extremely lucky, that we’ll ever have a sequence of every single individual in a population that we study. To get a little closer to the scale of the genomic data that we usually work with on a day-to-day basis, we can restrict our simulation to only record a subset of individuals.
We can precisely define which individuals (from which populations, and at which times) should be recorded in a tree sequence using the slendr function schedule_sampling(). For instance, if we have a model with some slendr populations in variables eur and afr, we can schedule the recording of 5 individuals from each at times 10000 (years ago) and 0 (present-day) (using the “years before present” direction of time in our current model of Neanderthal introgression) with the following code:
pop_schedule <- schedule_sampling(model, times = c(10000, 0), list(eur, 5), list(afr, 5))This function simply returns a data frame. As such, we can create multiple of such schedules (of arbitrary complexity and granularity), and then bind them together into a single sampling schedule with a single line of code, like this:
# Note that the `times =` argument can be a vector of times like here...
ancient_times <- c(40000, 30000, 20000, 10000)
eur_samples <- schedule_sampling(model, times = ancient_times, list(eur, 1))
# ... but also just a single number like here
afr_samples <- schedule_sampling(model, times = 0, list(afr, 1), list(eur, 42))
nea_samples <- schedule_sampling(model, time = 60000, list(nea, 1))
# And that multiple such sampling schedules can be bound together like this
schedule <- rbind(eur_samples, afr_samples, nea_samples)
scheduleUsing the function schedule_sampling (and with the help of rbind as shown in the previous code chunk), program the sampling of the following sample sets at given times, saving it to variable called schedule:
| time | population(s) | # individuals |
|---|---|---|
| 70000 | nea | 1 |
| 40000 | nea | 1 |
| 0 | chimp, afr, eur | 5 |
Additionally, schedule the sampling of one eur individual at these times:
t <- seq(40000, 2000, by = -2000)Note: You can provide a vector variable (such as t in this example) as the times = argument of schedule_sampling().
In total, you should schedule the recording of 38 individuals.
Then, verify the correctness of your overall sampling schedule by visualizing it together with your model like this:
Note: As you’ve seen above, the visualization is often a bit wonky and convoluted with overlapping elements and it can be even worse with samples added, but try to experiment with arguments to plot_model described above to make the plot a bit more helpful for sanity checking.
plot_model(model, samples = schedule)Part 6: Simulating a defined set of individuals
You have now both a compiled slendr model and a well-defined sampling schedule.
Use your combined sampling schedule stored in the schedule variable to run a new tree-sequence simulation from your model (again using the msprime() function), this time restricted to just those individuals scheduled for recording. You can do this by providing the combined sampling schedule as the samples = schedule argument of the function msprime you used above. Don’t worry about overwriting the previous contents of ts <- msprime(...) in your script, just replace your old msprime() call with the new one which uses the schedule for your customized sampling.
Also, while you’re doing this, use the ts_mutate() function to add overlay neutral mutations on the simulated tree sequence right after the msprime() call. (Take a look at the handounts for a reminder of the %>% pipe patterns I showed you.)
Inspect the tree-sequence object saved in the ts variable by typing it into the R console again (this interactivity really helps with catching nasty bugs early during the programming of your script). You can also do a similar thing via the table produced by the ts_samples() function. You should see a much smaller number of individuals being recorded, indicating that the simulation was much more efficient and produced genomic data for only the individuals of interest.
Note: When you think about it, it’s actually quite astonishing how fast msprime and tskit are when dealing with such a huge amount of sequence data from tens of thousands of individuals on a simple laptop!
Exercise 2: Computing popgen statistics on tree sequences from slendr
In this exercise, you will build on top of the results from Exercise 1. Specifically, we will learn how to compute popgen statistics on slendr-simulated tree sequences using slendr’s interface to the tskit Python module.
First, create a new R script exercise2.R and paste in the following code. This is one of the possible solutions to the Exercise 1, and it’s easier if we all use it to be on the same page from now on, starting from the same model and the same simulated tree sequence:
library(slendr)
init_env()
## The interface to all required Python modules has been activated.
chimp <- population("CHIMP", time = 7e6, N = 5000)
afr <- population("AFR", parent = chimp, time = 6e6, N = 15000)
eur <- population("EUR", parent = afr, time = 70e3, N = 3000)
nea <- population("NEA", parent = afr, time = 600e3, N = 1000, remove = 40e3)
gf <- gene_flow(from = nea, to = eur, rate = 0.03, start = 55000, end = 50000)
model <- compile_model(
populations = list(chimp, nea, afr, eur),
gene_flow = gf,
generation_time = 30
)
# We will read a cached version of a tree sequence to make sure we're all on
# the same page. That said, if you managed to do Exercise 1 on your own, feel
# free to stick with your own simulated tree sequence!
ts <- ts_read(here::here("data/introgression.trees"), model = model)
cowplot::plot_grid(
plot_model(model, proportions = TRUE),
plot_model(model, proportions = TRUE, log = TRUE),
nrow = 1
)As a sanity check, let’s use a couple of tidyverse table-munging tricks to make sure the tree sequence does contain a set of sample which matches our intended sampling schedule (particularly the time series of European individuals and the two Neanderthals):
library(dplyr)
ts_samples(ts) %>% nrow[1] 38
ts_samples(ts) %>% filter(pop == "EUR") %>% pull(time) [1] 40000 38000 36000 34000 32000 30000 28000 26000 24000 22000 20000 18000
[13] 16000 14000 12000 10000 8000 6000 4000 2000 0 0 0 0
[25] 0 0 0 0 0 0
ts_samples(ts) %>% filter(pop == "NEA") %>% pull(time)[1] 70000 40000
ts_samples(ts) %>% group_by(pop, time == 0) %>% tally %>% select(pop, n)# A tibble: 5 × 2
# Groups: pop [4]
pop n
<chr> <int>
1 AFR 5
2 CHIMP 1
3 EUR 20
4 EUR 10
5 NEA 2
Everything looks good! Having made sure that the ts object contains the individuals we want, let’s move to the exercise.
Part 1: Computing nucleotide diversity
The toy model of ancient human history plotted above makes a fairly clear prediction of what would be the nucleotide diversity expected in the simulated populations. Compute the nucleotide diversity in all populations using the slendr function ts_diversity(). Do you get results you would expect from the model?
Hint: Nearly every slendr statistic function interfacing with tskit accepts a ts tree-sequence object as its first argument, with further arguments being either a vector of individual names representing a group of samples to compute a statistic on, or a (named) list of such vectors (each element of that list for a group of samples) – these lists are intended to be equivalent to the sample_sets = argument of many tskit Python methods (which you’ve learned about in your activity on tskit), except that they allow symbolic names of individuals, rather then integer indices of nodes in a tree sequence.
Although you can get all the above information by processing the table produced by the ts_samples() function, slendr provides a useful helper ts_names() which only returns the names of individuals.
When you call it directly, you get a plain vector of individual names:
ts_names(ts) [1] "NEA_1" "EUR_1" "NEA_2" "EUR_2" "EUR_3" "EUR_4" "EUR_5"
[8] "EUR_6" "EUR_7" "EUR_8" "EUR_9" "EUR_10" "EUR_11" "EUR_12"
[15] "EUR_13" "EUR_14" "EUR_15" "EUR_16" "EUR_17" "EUR_18" "EUR_19"
[22] "EUR_20" "AFR_1" "AFR_2" "AFR_3" "AFR_4" "AFR_5" "CHIMP_1"
[29] "EUR_21" "EUR_22" "EUR_23" "EUR_24" "EUR_25" "EUR_26" "EUR_27"
[36] "EUR_28" "EUR_29" "EUR_30"
This is not super helpful, unless we want to compute some statistic for everyone in the tree sequence, regardless of their population assignment. Perhaps a bit more useful is to call the function like this, because it will produce a result which can be immediately used as the sample_sets = argument mentioned in the Hint above:
ts_names(ts, split = "pop")$AFR
[1] "AFR_1" "AFR_2" "AFR_3" "AFR_4" "AFR_5"
$CHIMP
[1] "CHIMP_1"
$EUR
[1] "EUR_1" "EUR_2" "EUR_3" "EUR_4" "EUR_5" "EUR_6" "EUR_7" "EUR_8"
[9] "EUR_9" "EUR_10" "EUR_11" "EUR_12" "EUR_13" "EUR_14" "EUR_15" "EUR_16"
[17] "EUR_17" "EUR_18" "EUR_19" "EUR_20" "EUR_21" "EUR_22" "EUR_23" "EUR_24"
[25] "EUR_25" "EUR_26" "EUR_27" "EUR_28" "EUR_29" "EUR_30"
$NEA
[1] "NEA_1" "NEA_2"
As you can see, this gave us a normal R list, with each element containing a vector of individual names in a population. Note that we can use standard R list indexing to get subsets of individuals:
names <- ts_names(ts, split = "pop")
names["NEA"]$NEA
[1] "NEA_1" "NEA_2"
names[c("EUR", "NEA")]$EUR
[1] "EUR_1" "EUR_2" "EUR_3" "EUR_4" "EUR_5" "EUR_6" "EUR_7" "EUR_8"
[9] "EUR_9" "EUR_10" "EUR_11" "EUR_12" "EUR_13" "EUR_14" "EUR_15" "EUR_16"
[17] "EUR_17" "EUR_18" "EUR_19" "EUR_20" "EUR_21" "EUR_22" "EUR_23" "EUR_24"
[25] "EUR_25" "EUR_26" "EUR_27" "EUR_28" "EUR_29" "EUR_30"
$NEA
[1] "NEA_1" "NEA_2"
etc.
Many of the following exercises will use these kinds of tricks to instruct various slendr / tskit functions to compute statistics on subsets of all individuals sub-sampled in this way.
After you computed the nucleotide diversity per-population, compute it for each individual separately using the same function ts_diversity(), which gives you the heterozygosity for each individual.
Hint: You can do this by giving a vector of names as sample_sets = (so not an R list of vectors). Feel free to take a look at the solution if you find this distinction confusing.
Part 2: Computing pairwise divergence
Use the function ts_divergence() to compute genetic divergence between all pairs of populations. Do you get results compatible with our demographic model?
Hint: Again, you can use the same concept of sample_sets = we discussed in the previous part. In this case, the function computes pairwise divergence between each element of the list given as sample_sets = (i.e., for each vector of individual names).
Part 3: Detecting Neanderthal admixture in Europeans
Let’s now pretend its about 2008, we’ve sequenced the first Neanderthal genome, and we are working on a project that will change human evolution research forever. We also have the genomes of a couple of people from Africa and Europe, which we want to use to answer the most burning question of all evolutionary anthropology: “Do some people living today carry Neanderthal ancestry?”
Earlier you’ve learned about \(f\)-statistics of various kinds. You have also heard that an \(f_4\) statistic (or its equivalent \(D\) statistic) can be used as a test of “treeness”. Simply speaking, for some “quartet” of individuals or population samples, they can be used as a hypothesis test of whether the history of those samples is compatible with there not having been an introgression.
Compute the \(f_4\) test of Neanderthal introgression in EUR individuals using the slendr function ts_f4(). When you’re running it, you will have to provide individuals to compute the statistic on using a slightly different format. Take a look at the help page available as ?ts_f4 for more information. When you’re computing the \(f_4\), make sure to set mode = "branch" argument of the ts_f4() function (we will get to why a bit later).
Hint: If you haven’t learned this in your \(f\)-statistics lecture, you want to compute (and compare) the values of these two statistics using the slendr function ts_f4():
\(f_4\)(<some African>, <another African>; <Neanderthal>, <Chimp>)
\(f_4\)(<some African>, <a test European>; <Neanderthal>, <Chimp>)
Roughly speaking, we can understand this computation in terms of comparing the counts of so-called BABA and ABBA allele patterns between the quarted of samples specified in the statistics:
\[ f_4(AFR, X; NEA, CHIMP) = \frac{\#BABA - \#ABBA}{\#SNPs} \]
The first is not expected to give values “too different” from 0 because we don’t expect two African individuals to differ “significantly” in terms of how much alleles they share with a Neanderthal. The other should – if there was a Neanderthal introgression into Europeans some time in their history – be “significantly negative”.
Is the second of those two statistics “much more negative” than the first, as expected assuming introgression from Neanderthals into Europeans?
In the text right above, why am I putting “significantly” and “much more negative” in quotes? What are we missing here for this being a true hypothesis test as you might be accustomed to from computing \(f\)-statistics using a tool such as ADMIXTOOLS? (We will get to this again in the following part of this exercise.)
Part 5: Detecting Neanderthal admixture in Europeans v2.0
The fact that we don’t get something equivalent to a p-value in these kinds of simulations is generally not a problem, because we’re often interested in establishing a trend of a statistic under various conditions, and understanding when and how it behaves in a certain way. If statistical noise is a problem, we can easily work around it by computing a statistic on multiple simulation replicates or even increasing the sample sizes. I used this approach quite successfully on a related problem in this paper.
On top of that, p-value of something like an \(f\)-statistic (whether it’s significantly different from zero) is also strongly affected by quality of the data, sequencing errors, coverage, etc. (which can certainly be done using simulations!). However, these are an orthogonal issue which has little to do with the underlying evolutionary model in question anyway – and it’s the investigation of expected patterns of behavior of popgen statistics what we’re after in our exercises.
In short, how much is “significantly different from zero compared to some baseline expectation” can be easily drowned in noise in a simple single-individual comparisons as we did above. Let’s increase the sample size a bit to see if a statistical pattern becomes more apparent.
Compute the first \(f_4\) statistic of the baseline expectation between a pair of Africans and the second \(f_4\) statistic comparing an African and a European, but this time on all recorded Africans and all recorded Europeans, respectively. Plot the distributions of those two sets of statistics. This should remove lots of the uncertainty and make a statistical trend stand out much more clearly.
Hint: Whenever you need to compute something for many things in sequence, looping is very useful. One way to do compute, say, an \(f_4\) statistic over many individuals is use this kind of pattern using R’s looping function lapply():
# Loop over vector of individual names (variable x) and apply a given ts_f4()
# expression on each individual (note the ts_f4(..., X = x, ...) in the code)
list_f4 <- lapply(
c("ind_1", "ind_2", ...),
function(x) ts_f4(ts, W = "AFR_1", X = x, Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
)
# The above gives us a list of data frames, so we need to bind them all into a
# single table for easier interpretation and visualization
df_f4 <- do.call(rbind, list_f4)Exercise 3: Simulation-based inference of \(N_e\)
What we did so far is learning how slendr models provide an easy way to encode demographic models in R and simulate (even very large!) tree sequences from them. This allows us to very quickly verify our intuition about some popgen problem (things like “Hmmm, I wonder what would an \(f_4\) statistic look like if my model includes this particular gene-flow event?), in just a few lines of R. We have been able to even answer some questions like this directly in a meeting, pretty much on the spot! This makes slendr a very powerful “popgen calculator”.
Now let’s take things one step forward. Imagine you gathered some empirical data, like an allele frequency spectrum (AFS) from a population that you study. That data was, in the real world, produced by some (hidden) biological process (demographic history) that we generally don’t know anything about a priori. For instance, the population we study had some \(N_e\), which we don’t know the value of – the only thing we have is the observed AFS.
Simulations can be a great tool to estimate the most likely value of such an unknown parameter. Briefly speaking, and staying with this \(N_e\) and AFS example, we can simulate a large number of AFS vectors (each resulting from a different assumed \(N_e\) value) and then pick just those \(N_e\) values (or just one \(N_e\) value) which produced a simulated AFS closest to the observed AFS. This is exactly what you’ll be doing just now in Exercise 3.
Part 1: A self-contained slendr function of \(N_e \rightarrow \textrm{AFS}\)
In a new script exercise3.R write a custom R function called simulate_afs(), which will take Ne as its only parameter. Use this function to compute (and return) AFS vectors for a couple of Ne values of your choosing, but staying between Ne = 1000 and Ne = 30000 Plot those AFS vectors and observe how (and why?) do they differ based on Ne parameter you used in each respective simulation.
Hint: The function should create a one-population forward-time model (our population starting at time = 1, with the model simulation_length 100000 and generation_time 1), simulate 10Mb tree sequence (recombination and mutation rates of 1e-8), compute AFS for 10 samples and return it the AFS vector as its result.
Hint: If you’ve never programmed before, the concept of a “custom function” might be very alien to you. If you need help, feel free to start building your exercise3.R solution based on this “template” (just fill in missing relevant bits of slendr code that you should be already familiar with):
library(slendr)
init_env()
simulate_afs <- function(Ne) {
# In here you should write code which will compile a single-population
# demographic model based on the parameters specified above. It should
# then simulate 10Mb of tree sequence using `msprime()`, compute an AFS
# vector using the function `ts_afs()` for any subset of 10 individuals
# in the tree sequence, and finally return that vector.
return(result) # `result` is a variable with your 10-sample AFS, return it
}
afs_1 <- simulate_afs(Ne = 1000) # simulate AFS from a Ne = 1000 model...
plot(afs_1, type ="o") # ... and plot itNote: Remember that you should drop the first element of the AFS vector produced by ts_afs() (for instance with something like result[-1] if result contains the output of ts_afs()) technical reasons related to tskit. You don’t have to worry about that here, but you can read this for more detail.
When used in R, your function should work like this:
simulate_afs(Ne = 1000) [1] 4000 2000 1333 1000 800 667 571 500 444 400 364 333 308 286 267
[16] 250 235 222 211 200
Part 2: Estimating unknown \(N_e\) from empirical AFS
Imagine you sequenced 10 samples from a population and computed the following AFS vector (which contains, sequentially, the number of singletons, doubletons, etc., in your sample from a population):
afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
264, 218, 133, 173, 159, 142, 167, 129, 125, 143)You know (maybe from some fossil evidence) that the population probably had a constant \(N_e\) somewhere between 1000 and 30000 for the past 100,000 generations, and had mutation and recombination rates of 1e-8 (i.e., parameters already implemented by your simulate_afs() function).
Use slendr simulations to guess the true (and hidden!) \(N_e\) given the observed AFS by running simulations for a range of \(N_e\) values and comparing each run to the afs_observed vector above.
Hint: Depending on your level of comfort with R (and programming in general), you can choose one of the following approaches:
Option 1 [easy]: Plot AFS vectors for various \(N_e\) values, then eyeball which looks closest to the observed AFS based on the figures alone.
Option 2 [hard]: Simulate AFS vectors in steps of possible
Ne(maybelapply()?), and find the \(N_e\) which gives the closest AFS to the observed AFS based on Mean squared error.
Exercise 4: Simulating dynamics of positive selection
The primary motivation for designing slendr was to make demographic modelling in R as trivially easy and fast as possible, focusing exclusively on neutral models. However, as slendr became popular, people have been asking for the possibility of simulating natural selection. After all, a large part of slendr’s functionality deals with population genetic models across geographical landscapes, which requires SLiM. Even though these models are neutral, SLiM is the most famous and popular tool for simulating various aspects of natural selection.
Recently I caved in and added support for modifying slendr demographic models with bits of SLiM code which allows simulating pretty much any arbitrary selection scenario you might be interested in. This exercise is a quick demonstration of how this work and how you might simulate selection using slendr. We will do this using another toy model of ancient human history, which we will use as a basis for simulating the frequency trajectory of an allele under positive selection, and implementing a simpleselection scan summary statistic using Tajima’s D.
To speed things up, create a new exercise4.R script and copy the following code as a starting point for this exercise:
library(slendr)
init_env(quiet = TRUE)
# This line sources a script in which I provide a few useful helper functions
# which you can use in this exercise
source(here::here("utils.R"))
# African ancestral population
afr <- population("AFR", time = 65000, N = 5000)
# First migrants out of Africa
ooa <- population("OOA", parent = afr, time = 60000, N = 5000, remove = 27000)
# Eastern hunter-gatherers
ehg <- population("EHG", parent = ooa, time = 28000, N = 5000, remove = 6000)
# European population
eur <- population("EUR", parent = ehg, time = 25000, N = 5000)
# Anatolian farmers
ana <- population("ANA", time = 28000, N = 5000, parent = ooa, remove = 4000)
# Yamnaya steppe population
yam <- population("YAM", time = 8000, N = 5000, parent = ehg, remove = 2500)
# Define gene-flow events
gf <- list(
gene_flow(from = ana, to = yam, rate = 0.75, start = 7500, end = 6000),
gene_flow(from = ana, to = eur, rate = 0.5, start = 6000, end = 5000),
gene_flow(from = yam, to = eur, rate = 0.6, start = 4000, end = 3500)
)
# Compile all populations into a single slendr model object
model <- compile_model(
populations = list(afr, ooa, ehg, eur, ana, yam),
gene_flow = gf, generation_time = 30
)
# Schedule the sampling from four European populations roughly before their
# disappearance (or before the end of the simulation)
schedule <- rbind(
schedule_sampling(model, times = 0, list(eur, 50)),
schedule_sampling(model, times = 6000, list(ehg, 50)),
schedule_sampling(model, times = 4000, list(ana, 50)),
schedule_sampling(model, times = 2500, list(yam, 50))
)Next, visualize the demographic model. If you did a bit of work in human population genetics, you might recognize it as a very simplified model of demographic history of Europe over the past 50 thousand years or so. As you can see, we are recording 50 individuals from four populations – for Europeans we sample 50 individuals at “present-day”, for the remaining populations we’re recording 50 individuals just before their disappearance. Also note that there’s quite a bit of gene-flow! This was an important thing we’ve learned about human history in the past 10 years or so – everyone is mixed with pretty much everyone, there isn’t (and never was) anything as a “pure population”.
plot_model(model, proportions = TRUE, samples = schedule)Part 1: Simulating a tree sequence and computing Tajima’s D
Although the point of this exercise is to simulate selection, let’s first simulate a normal neutral model using slendr’s msprime() engine as a sanity check. Simulate 10 Mb of sequence with a recombination rate 1e-8 and a sampling schedule defined above. Let’s not worry about adding any mutations, just to change things up a little bit. We’ll be working with branch-based statistics here (which means adding mode = "branch" whenever we will be computing a statistic, such as Tajima’s D).
Inspect the table of all individuals recorded in our tree sequence using the function ts_samples(), making use we have all the individuals scheduled for tree-sequence recording.
As you’ve learned, tskit functions in slendr generally operate on vectors (or lists) of individual names, like those produced by ts_names() above. Get a vector of names of individuals in every population recorded in the tree sequence, then use this to compute Tajima’s D using the slendr function ts_tajima(). Do you see any striking differences in the Tajima’s D values across populations? Check this for some general guidance.
Part 2: Computing Tajima’s D in windows
Let’s take this one step forward. Even if there is a locus under positive selection somewhere along our chromosome, it might be quite unlikely that we would find a Tajima’s D value significant enough for the entire chromosome (which is basically what we did in the previous Part 1). Fortunately, thanks to the flexibility of the tskit module, the slendr function ts_tajima() has an argument windows =, which allows us to specify the coordinates of windows into which a sequence should be broken into, with Tajima’s D computed separately for each window. Perhaps this will allow us to see the impact of positive selection?
Define a variable windows which will contain a vector of coordinates of 100 windows, starting at position 0, and ending at position 10e6 (i.e., the end of our chromosome). Then provide this variable as the windows = argument of ts_tajima() on a new, separate line of your script. Save the result of ts_tajima() into the variable tajima_wins, and inspect its contents in the R console.*
Hint: You can use the R function seq() and its argument length.out = 100, to create the coordinates of window boundaries very easily.
The default output format of ts_tajima() is not super user-friendly. Process the result using a helper function process_tajima(tajima_wins) that I provided for you (perhaps save it as tajima_df), and visualize it using another of my helper functions plot_tajima(tajima_df).
Note: Making the process_tajima() and plot_tajima() function available in your R code is the purpose of the source(here::here("utils.R")) command at the beginning of your script for this exercise.
Part 3: Adding positive selection to the base demographic model
Although primarily designed for neutral demographic models, slendr allows optional simulation of natural selection by providing a “SLiM extension code snippet” with customization SLiM code as an optional argument extension = of compile_model() (a function you’re closely familiar with at this point).
Unfortunately we don’t have any space to explain SLiM here (and I have no idea, at the time of writing, whether or not you will have worked with SLiM earlier in this workshop). Suffice to say that SLiM is another very popular population genetic simulator software which allows simulation of selection, and which requires you to write custom code in a different programming language called Eidos.
Take a look at the file slim_extension.txt provided in your working directory (it’s also part of the GitHub repository here). If you worked with SLiM before, glance through the script casually and see if it makes any sense to you. If you have not worked with SLiM before, look for the strange {elements} in curly brackets in the first ten lines of the script. Those are the parameters of the selection model we will be customizing are standard neutral demographic model from in the next step.
Specifically, when you inspect the slim_extension.txt file, you can see that this “SLiM extension script” I provided for you has three parameters:
origin_pop– in which population should a beneficial allele appear,s– what should be the selection coefficient of the beneficial allele, andonset_time– at which time should the allele appear in theorigin_pop.
However, at the start, the SLiM extension snippet doesn’t contain any concrete values of those parameters, but only their \{{origin_pop}\}, \{{s}\}, and \{{onset_time}\} placeholders.
Use the slendr function substitute_values() to substitute concrete values for those parameters like this:
extension <- substitute_values(
template = here::here("slim_extension.txt"),
origin_pop = "EUR",
s = 0.15,
onset_time = 12000
)
extension[1] "/var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T//Rtmpxa9vCa/fileab7f6dda2259"
You can see that substitute_values() returned a path to a file. Take a look at that file in your terminal – you should see each of the three {placeholder} parameters replaced with a concrete given value.
And that’s all the extra work we need to turn our purely neutral demographic slendr model into a model which includes natural selection! (In this case, only a simple selection acting on a single locus, as you’ll see later, but this can be generalized to any imaginable selection scenario.)
How do we use the SLiM extension for our simulation? It’s very simple – we just have to provide the extension variable as an additional argument of good old compile_model(). This will compile a new slendr model which will now include the new functionality for simulating natural selection:
Compile a new model of the history of populations afr, ooa, ehg, etc., by following the instructions above, providing a new extension = argument to the compile_model() function.
Part 4: Running a selection simulation using slim()
Now we can finally run our selection simulation!
There are two modifications to our previous simulation workflows:
- Because we need to run a non-neutral simulation, we have to switch from using the
msprime()slendr engine toslim(). The latter can still interpret the same demographic model we programmed in R, just like themsprime()engine can, but will run the model using SLiM (and thus leveraging the new SLiM extension code that we have customized usingsubstitute_values()above). We simply do this by switching from this:
ts <- msprime(model, sequence_length = 10e6, recombination_rate = 1e-8, samples = schedule)to this:
ts <- slim(model, sequence_length = 10e6, recombination_rate = 1e-8, samples = schedule)As you can see, you don’t have to modify anything in your model code, just switching from msprime to slim in the line of code which produces the simulation result.
- The customized model will not only produce a tree sequence, but will also generate a table of allele frequencies in each population (SLiM experts might have noticed the revelant SLiM code when they were inspecting
slim_extension.txt). We need to be able to load both of these files after the simulation and thus need a path to a location we can find those files. We can do this by calling theslim()function aspath <- slim(..., path = TRUE)(so with the extrapath =argument). This will return a path to where theslim()engine saved all files with our desired results.
Run a simulation from the modified model of selection with the slim() engine as instructed in points number 1. and 2. above, then use the list.files(path) function in R to take a look in the directory. Which files were produced by the simulation?
Part 5: Investigating allele frequency trajectories
Use another helper function read_trajectory(path) which I provided for this exercise to read the simulated frequency trajectories of the positively selected mutation in all of our populations into a variable traj_df. Then run a second helper function plot_trajectory(traj_df) to inspect the trajectories visually.
Recall that you used the function substitute_values() to parametrize your selection model so that the allele under selection occurs in Europeans 15 thousand years ago, and is programmed to be under very strong selection of \(s = 0.15\). Do the trajectories visualized by plot_trajectory() make sense given the demographic model of European prehistory plotted above?
Part 6: Tajima’s D (genome-wide and window-based) from the selection model
Recall that your simulation run saved results in the location stored in the path variable:
list.files(path)[1] "slim.trees" "trajectory.tsv"
From this path, we’ve already successfuly investigated the frequency trajectories.
Now let’s compute Tajima’s D on the tree sequence simulated from our selection model. Hopefully we should see an interesting pattern in our selection scan? For instance, we don’t know yet where in the genome is the putative locus under selection!
To read a tree sequence simulated with slim() by our customized selection setup, we need to do a bit of work. To simplify things a bit, here’s the R code which makes it possible to do. Just copy it in your exercise4.R script as it is:
# Let's use my own saved simulation results, so that we're all on the
# same page going forward
path <- here::here("data/selection")
ts <-
file.path(path, "slim.trees") %>% # 1. compose full path to the slim.trees file
ts_read(model) %>% # 2. read the tree sequence file into R
ts_recapitate(Ne = 5000, recombination_rate = 1e-8) # 3. perform recapitationVery briefly, because our tree sequence was generated by SLiM, it’s very likely that not all genealogies along the simulated genome will be fully coalesced (i.e., not all tree will have a single root). To explain why this is the case is out of the scope of this session, but read here if you’re interested in learning more. For the time being, it suffices to say that we can pass the (uncoalesced) tree sequence into the ts_recapitate() function, which then takes a SLiM tree sequence and simulates all necessary “ancestral history” that was missing on the uncoalesced trees, thus ensuring that the entire tree sequence is fully coalesced and can be correctly computed on.
Now that you have a ts tree sequence object resulting from a new selection simulation run, repeat the analyses of genome-wide and window-based Tajima’s D from Part 1 and Part 2 of this exercise, again using the provided helper functions process_tajima() and plot_tajima(). Can you identify which locus has been the likely focal point of the positive selection? Which population shows evidence of selection? Which doesn’t and why (look again at the visualization of the demographic model above)?
Exercise 5: Programming of your own model
Do you have a favourite model organism or you study another species? Take some of the most important features of its demographic history and program them as a new slendr model.
Then, think about some interesting population genetic statistics that you either know from the literature or that you perhaps computed on your own empirical data. Try to compute them on a tree sequence simulated from your model using msprime(). Do you get results which are in the ballpark of your expectations?